Coherent Gene Expression Pattern Finding Using Clustering Approaches
ثبت نشده
چکیده
Analysis of gene expression data is an important research field in DNA microarray research. Data mining techniques have proven to be useful in understanding gene function, gene regulation, cellular processes and subtypes of cells. Most data mining algorithms developed for gene expression data deal with the problem of clustering. The purpose of this thesis is to study different clustering approaches for gene expression data. Our first contribution is a dissimilarity measure (DBK) which retains the regulation information and is robust to outliers. We have developed a graph-based clustering algorithm (GCA) for gene expression data. Its main idea is that, inter-cluster genes have more repulsion among them than intra-cluster genes. In particular, at any given moment, genes are clustered based on a repulsion factor which is based on the genes that are yet to be assigned a cluster. This consideration leads to an objective function that is used to find the cluster parameter that optimizes this objective function. Comparison of GCA with competitive algorithms over different real world data sets shows the superiority of our approach. We have also developed a nearest neighbor based clustering algorithm which incorporates frequent itemset mining (FINN). The output of the frequent itemset mining phase is fed as input to the nearest neighbor clustering for detection of clusters. The process is iterated over multiple passes. After each pass, the dataset is pruned by not considering the genes that have already been assigned clusters. Experimental evaluation shows the method is capable in finding finer clustering of the dataset. This thesis also includes a density based clustering algorithm (DGC) which uses the regulation information and the order preserving property of gene expression profiles to cluster genes into high density regions separated by sparse density regions. The proposed algorithm has been validated on several real-life data sets and found to perform well in comparison to similar algorithms. This thesis also incorporates an incremental version of the DGC algorithm (incDGC). Experimental results on six real world gene expression datasets demonstrate that incDGC can cluster the data in an efficient manner while at the same time obtain the same result as when DGC is applied to the whole updated database. All clustering algorithms have been validated using various statistical measures to show their effectiveness
منابع مشابه
به کارگیری روشهای خوشهبندی در ریزآرایه DNA
Background: Microarray DNA technology has paved the way for investigators to expressed thousands of genes in a short time. Analysis of this big amount of raw data includes normalization, clustering and classification. The present study surveys the application of clustering technique in microarray DNA analysis. Materials and methods: We analyzed data of Van’t Veer et al study dealing with BRCA1...
متن کاملModification of the Fast Global K-means Using a Fuzzy Relation with Application in Microarray Data Analysis
Recognizing genes with distinctive expression levels can help in prevention, diagnosis and treatment of the diseases at the genomic level. In this paper, fast Global k-means (fast GKM) is developed for clustering the gene expression datasets. Fast GKM is a significant improvement of the k-means clustering method. It is an incremental clustering method which starts with one cluster. Iteratively ...
متن کاملFinding Exact and Solo LTR-Retrotransposons in Biological Sequences Using SVM
Finding repetitive subsequences in genome is a challengeable problem in bioinformatics research area. A lot of approaches have been proposed to solve the problem, which could be divided to library base and de novo methods. The library base methods use predetermined repetitive genome’s subsequences, where library-less methods attempt to discover repetitive subsequences by analytical approach...
متن کاملGPX: Interactive Mining of Gene Expression Data
Discovering co-expressed genes and coherent expression patterns in gene expression data is an important data analysis task in bioinformatics research and biomedical applications. Although various clustering methods have been proposed, two tough challenges still remain on how to integrate the users’ domain knowledge and how to handle the high connectivity in the data. Recently, we have systemati...
متن کاملLinear Coherent Bi-cluster Discovery via Beam Detection and Sample Set Clustering
We propose a new bi-clustering algorithm, LinCoh, for finding linear coherent bi-clusters in gene expression microarray data. Our method exploits a robust technique for identifying conditionally correlated genes, combined with an efficient density based search for clustering sample sets. Experimental results on both synthetic and real datasets demonstrated that LinCoh consistently finds more ac...
متن کامل